Analysing The Impact of a New E-commerce Web Page on Conversion Rate (A/B Testing)¶

In this project we will run an A/B testing analysis by leveraging the Chi-Square test for independence and assess the impact of a new ecommerce web page on the users conversion rate compared to an old web page.

Project Summary¶

Context¶

An e-commerce company has developed a new web page in order to try and increase the number of users who "convert," meaning the number of users who decide to pay for the company's product.

The users were randomly divided into two groups: the treatment group have access to only the new web page and the control group have access to only the old web page.

The company want understand if they should implement this new page, keep the old page, or perhaps run the experiment longer to make their decision.

Actions¶

To perform our analysis, we first explored the data and made some cleanup. We observed there are 3,895 duplicated user_id, some of these users where exposed to both the new and old web pages and some users where in both treatment group and control group. This mismatch entries were about 1.32% of the entire dataset. For the purpose of our analysis we excluded the mismatched data to resolve the disparity.

To compare the conversion rate of the two groups, we used the Chi-Square test for independence. To run the test, we set the following hypothesis and alpha level:

Null Hypothesis = There is no relationship between landing page type and conversion rate.
Alternate Hypothesis = There is a relationship between landing page type and conversion rate.
Alpha = 0.05

Results & Conclusion¶

We performed the analysis and obtained the following results:

users from the control group had a conversion rate of 12.03%
users from the treatment group had a conversion rate of 11.88%

We noticed the users of the old web page have a bit higher conversion rate than the users of the new web page. We then performed the Chi-Squared test.

The results obtained from the Chi-Square test gave us a p-value of 0.19 which is greater than our alpha value of 0.05 and a chi-square statistic of 1.71 which is less than our critical value of 3.84.
Hence we failed to reject the Null Hypothesis and concluded that there was no significant difference between the conversion rates of the old landing page and the new landing page.

In conclusion, our analysis showed that it will not be a good idea for the company to change their old web page for the new one at the moment.
We recommended that the company runs the experiment again to collect more data and make sure users from each group only access the web page their supposed to access. Doing that may help us obtain more insight.

Loading Required Packages¶

In [ ]:

# load the required python libraries
import pandas as pd
from scipy.stats import chi2_contingency, chi2

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style("whitegrid")

import warnings
warnings.filterwarnings('ignore')

Data Description & Preparation¶

We will load the data collected by the company after the experiment and have a preview of the first 5 rows. This will give us a feel of how the data is stored. We will then describe the data to have an idea of the full structure.

In [ ]:

# load the data & check the first 5 rows 
df = pd.read_csv('./campaign_data.csv')
df.head()

Out[ ]:

	user_id	timestamp	group	landing_page	converted
0	851104	11:48.6	control	old_page	0
1	804228	01:45.2	control	old_page	0
2	661590	55:06.2	treatment	new_page	0
3	853541	28:03.1	treatment	new_page	0
4	864975	52:26.2	control	old_page	1

In [ ]:

# describe the entire dataset after excluding the timestamp
df1=df.drop('timestamp', axis=1)
summary_list = [df1.describe().T]+[df1.groupby([c])[df1.columns[0]].count() for c in df1.columns if df[c].dtype == 'object']

for i in summary_list:
    print(i)
    print()

              count           mean           std       min        25%  \
user_id    294480.0  787973.538896  91210.917091  630000.0  709031.75   
converted  294480.0       0.119658      0.324562       0.0       0.00   

                50%        75%       max  
user_id    787932.5  866911.25  945999.0  
converted       0.0       0.00       1.0  

group
control      147202
treatment    147278
Name: user_id, dtype: int64

landing_page
new_page    147241
old_page    147239
Name: user_id, dtype: int64

In [ ]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294480 entries, 0 to 294479
Data columns (total 5 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   user_id       294480 non-null  int64 
 1   timestamp     294480 non-null  object
 2   group         294480 non-null  object
 3   landing_page  294480 non-null  object
 4   converted     294480 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 11.2+ MB

We can observed the data consist of 5 columns (user_id, timestamp, group, landing_page, converted) and 294480 entries. The group consist of 2 levels, control and treatment. The control represent users who were not exposed to the new web page, and the treatment consist of users who were exposed to the new web page. The landing_page also consist of 2 levels, old_page and new_page. The converted column consist of 0 and 1 where 1 indicate the user bought an item and 0 indicate the user didn't purchase an item.

Data Cleaning¶

Let's check if there is any duplicated entries

In [ ]:

df.duplicated().sum()

Out[ ]:

In [ ]:

# checking for duplicated user id
df.user_id.duplicated().sum()

Out[ ]:

In [ ]:

# percentage of duplicated data
df.user_id.duplicated().sum()/len(df)*100

Out[ ]:

1.3226704699809835

In [ ]:

duplicated_ids = df.user_id[df.user_id.duplicated()]

df.query('user_id in @duplicated_ids').sort_values('user_id').head(10)

Out[ ]:

	user_id	timestamp	group	landing_page	converted
213114	630052	25:54.1	treatment	old_page	1
230259	630052	16:05.2	treatment	new_page	0
22513	630126	35:54.8	treatment	old_page	0
251762	630126	16:00.3	treatment	new_page	0
183371	630137	08:49.9	control	old_page	0
11792	630137	59:22.1	control	new_page	0
207211	630320	02:43.6	control	old_page	0
255753	630320	27:37.2	treatment	old_page	0
110634	630471	42:51.5	control	old_page	0
96929	630471	14:17.4	control	new_page	0

We observed there is no duplicated entries but there are 3,895 duplicated user_id, some of these users where exposed to both the new and old landing pages (eg. user_id 630052) and some users where in both treatment group and control group (eg. user_id 630320).

In this experiment, we expect the users in the control group to only access the old landing page and users in the treatment group to access only the new landing page. The mismatch in the data may be due to some internal problem during the run of the experiment. This mismatch entries is about 1.32% of the entire dataset. For the purpose of our analysis we will drop the mismatched data to resolve the disparity.

Resolving Mismatched Entries¶

In [ ]:

# Excluding the mismatching data
data=df.query('(group == "treatment" and landing_page =="new_page") | (group =="control" and landing_page =="old_page")')

# Checking user_id duplicate
data.user_id.duplicated().sum()

Out[ ]:

In [ ]:

# Dropping the duplicated user ids
data.drop_duplicates(subset="user_id",inplace=True)

Obverved Conversion Rates¶

Distribution of Conversion By Web Page Type¶

In [ ]:

ax=sns.countplot(x="converted", data=data, hue="landing_page", palette=['blue','green'])
ax.set(title="Number of Conversion By Web Page Type")

Out[ ]:

[Text(0.5, 1.0, 'Number of Conversion By Web Page Type')]

The plot above indicates there no much difference between the number of conversions generated by the old web page and the new web page. We can compute the conversion rates of each group for further exploration.

In [ ]:

# The probability of an individual converting given that an individual was in the control group
len(data.query('group=="control" and converted==1'))/len(data.query('group=="control"'))

Out[ ]:

0.1203863045004612

In [ ]:

# The probability of an individual converting given that an individual was in the treatment group
len(data.query('group=="treatment" and converted==1'))/len(data.query('group=="treatment"'))

Out[ ]:

0.11880724790277405

The users from the control group had a conversion rate of 12.03% and the users from the treatment group had a conversion rate of 11.88%. We notice the users of the old web page have a bit higher conversion rate than the users of the new web page.
It seems the new web page is not doing a better job compared to the old web page. To reach a firm conclusion we will conduct a Chi-square test for independence.

Chi-Square Test For Independence¶

Define The Test Hypotheses¶

To begin our test, we first define our Null hypothesis and the Alternate hypothesis. We also define our alpha value which will later be used in interpretating the result of our test.

In [ ]:

# Define the hypotheses & acceptance criteria for test
null_hypothesis = "There is no relationship between landing page type and conversion rate."
alternate_hypothesis = "There is a relationship between landing page type and conversion rate."
alpha = 0.05

In [ ]:

# Summarises our dataset to a 2x2 matrix for landing_page by conversion
matrix = pd.crosstab(data.landing_page, data.converted).values

# run the chi-square test
chi2_statistic, p_value, dof, expected_values = chi2_contingency(matrix)

# print chi-square statistic
chi2_statistic

Out[ ]:

1.7053502645115002

In [ ]:

# print p-value
p_value

Out[ ]:

0.19158976298516003

In [ ]:

# find the critical value for our test
critical_value = chi2.ppf(1 - alpha, dof)

# print critical value
critical_value

Out[ ]:

3.841458820694124

Analyses of Results & Conclusion¶

The results obtained from the Chi-Square test above give us a p-value of 0.19 which is greater than our alpha value of 0.05 and a chi-square statistic of 1.71 which is less than our critical value of 3.84.
Hence we fail to reject the Null Hypothesis and conclude that there is no significant difference between the conversion rates of the old landing page and the new landing page.

In conclusion, the result of our analysis shows that it will not be a good idea for the company to change their old web page for this new one at the moment.
I recommend that the company runs the experiment again to collect more data and make sure users from each group only access the web page their supposed to access. This will help to avoid a mismatched data situation and obtain more insight.